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(54) Information storage and retrieval 

(57) An information retrieval system in which a set 
of distinct information items map to respective nodes in 
an array of nodes of a self-organising map by mutual 
similarity of the information items, so that similar infor- 
mation items map to nodes at similar positions in the 
array of nodes comprises: 

a graphical user interface for displaying a represen- 
tation of at least some of the nodes as a two-dimen- 
sional display array of display points within a display 
area on a user display; 



a user control for defining a two-dimensional region 
of the display area; and 

a detector for detecting those display points lying 
within the two-dimensional region of the display ar- 
ea; 

the graphical user interface also displaying a list of 
data representing information items, being those in- 
formation items mapped onto nodes corresponding 
to display points displayed within the two-dimen- 
sional region of the display area. 
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Description 

[0001] This invention relates to information storage 
and retrieval. 

[0002] There are many established systems for locat- 
ing information (e.g. documents, images, emails, pat- 
ents, internet content or media content such as audio/ 
video content) by searching under keywords. Examples 
include internet search "engines" such as those provid- 
ed by "Google" ™ or "Yahoo" ™ where a search carried 
out by keyword leads to a list of results which are ranked 
by the search engine in order of perceived relevance. 
[0003] However, in a system encompassing a large 
amount of content, often referred to as a massive con- 
tent collection, it can be difficult to formulate effective 
search queries to give a relatively short list of search 
"hits". For example, at the time of preparing the present 
application, a Google search on the keywords "massive 
document collection" drew 243000 hits. This number of 
hits would be expected to grow if the search were re- 
peated later, as the amount of content stored across the 
internet generally increases with time. Reviewing such 
a list of hits can be prohibitively time-consuming. 
[0004] In general, some reasons why massive con- 
tent collections are not well utilised are: 

• a user doesn't know that relevant content exists 

• a user knows that relevant content exists but does 
not know where it can be located 

• a user knows that content exists but does not know 
it is relevant 

• a user knows that relevant content exists and how 
to find it, but finding the content takes a long time 

[0005] The paper "Self Organisation of a Massive 
Document Collection", Kohonen et al, IEEE Transac- 
tions on Neural Networks, VolH , No. 3, May 2000, pag- 
es 574-585 discloses a technique using so-called "self- 
organising maps" (SOMs). These make use of so-called 
unsupervised self-learning neural network algorithms in 
which "feature vectors" representing properties of each 
document are mapped onto nodes of a SOM. 
[0006] In the Kohonen et al paper, a first step is to pre- 
process the document text, and then a feature vector is 
derived from each pre-processed document. In one 
form, this may be a histogram showing the frequencies 
of occurrence of each of a large dictionary of words. 
Each data value {i.e. each frequency of occurrence of a 
respective dictionary word) in the histogram becomes a 
value in an n-value vector, where n is the total number 
of candidate words in the dictionary (43222 in the ex- 
ample described in this paper). Weighting may be ap- 
plied to the n vector values, perhaps to stress the in- 
creased relevance or improved differentiation of certain 
words.' 

[0007] The n-value vectors are then mapped onto 
smaller dimensional vectors (i.e. vectors having a 
number of values m (500 in the example in the paper) 



which is substantially less than n. This is achieved by 
multiplying the vector by an (n x m) "projection matrix" 
formed of an array of random numbers. This technique 
has been shown to generate vectors of smaller dimen- 

5 sion where any two reduced-dimension vectors have 
much the same vector dot product as the two respective 
input vectors. This vector mapping process is described 
in the paper "Dimensionality Reduction by Random 
Mapping: Fast Similarity Computation for Clustering", 

10 Kaski, Proc IJCNN, pages 413-418, 1998. 

[0008] The reduced dimension vectors are then 
mapped onto nodes (otherwise called neurons) on the 
SOM by a process of multiplying each vector by a "mod- 
el" (another vector). The models are produced by a 

15 learning process which automatically orders them by 
mutual similarity onto the SOM, which is generally rep- 
resented as a two-dimensional grid of nodes. This is a 
non-trivial process which took Kohonen et al six weeks 
on a six-processor computer having 800 MB of memory, 

20 for a document database of just under seven million doc- 
uments. Finally the grid of nodes forming the SOM is 
displayed, with the user being able to zoom into regions 
of the map and select a node, which causes the user 
interface to offer a link to an internet page containing the 

25 document linked to that node. 

* [0009] This invention provides an information retrieval 
system in which a set of distinct information items map 
to respective nodes in an array of nodes by mutual sim- 
ilarity of the information items, so that similar information 

30 items map to nodes at similar positions in the array of 
nodes; the system comprising: 

a graphical user interface for displaying a represen- 
tation of at least some of the nodes as a two-dimen- 
35 sional display array of display points within a display 
area on a user display; 

a user control for defining a two-dimensional region 
of the display area; and 

a detector for detecting those display points lying 
40 within the two-dimensional region of the display ar- 
ea; 

the graphical user interface also displaying a list of 
data representing information items, being those in- 
formation items mapped onto nodes corresponding 
45 to display points displayed within the two-dimen- 
sional region of the display area. 

[0010] The skilled man will realise that in the normal 
usage of the word "list", the "data representing informa- 
so tion items" could be the item itself, if it is of a size and 
nature appropriate for full display, or could be data in- 
dicative of the item. 

[0011] The invention also provides an information 
storage system in which a set of distinct information 
55 items are processed so as to map to respective nodes 
in an array of nodes by mutual similarity of the informa- 
tion items, such that similar information items map to 
nodes at similar positions in the array of nodes; the sys- 
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tern comprising: 

means for generating a feature vector derived from 
each information item, the feature vector for an in- 
formation item representing a set of frequencies of 
occurrence, within that information item, of each of 
a group of information features; and 
means for mapping each feature vector to a node 
in the array of nodes, the mapping between infor- 
mation items and nodes in the array including a dith- 
er component so that substantially identical infor- 
mation items tend to map to closely spaced but dif- 
ferent nodes in the array. The invention builds on 
the processes described in the Kohonen et al paper 
by providing a user interface conveniently allowing 
the userto associate displayed points on the screen 
with information items in a list of items, while allow- 
ing the user to distinguish conveniently between 
similar information items. 

[001 2] Further respective aspects and features of the 
invention are defined in the appended claims. 
[0013] Embodiments of the invention will now be de- 
scribed, by way of example only, with reference to the 
accompanying drawings in which: 

Figure 1 schematically illustrates an information 
storage and retrieval system; 
Figure 2 is a schematic flow chart showing the gen- 
eration of a self-organising map (SOM); 
Figures 3a and 3b schematically illustrate term fre- 
quency histograms; 

Figure 4a schematically illustrates a raw feature 
vector; 

Figure 4b schematically illustrates a reduced fea- 
ture vector; 

Figure 5 schematically illustrates an SOM; 
Figure 6 schematically illustrates a dither process;. 
Figures 7 to 9 schematically illustrate display 
screens providing a user interface to access infor- 
mation represented by the SOM; 
Figure 10 schematically illustrates a camcorder as 
an example of a video acquisition and/or processing 
apparatus; and 

Figure 11 schematically illustrates a personal digital 
assistant as an example of portable data process- 
ing apparatus 

[0014] Figure 1 Is a schematic diagram of an informa- 
tion storage and retrieval system based around a gen- 
eral-purpose computer 10 having a processor unit 20 
including disk storage 30 for programs and data, a net- 
work interface card 40 connected to a network 50 such 
as an Ethernet network or the Internet, a display device 
such as a cathode ray tube device 60, a keyboard 70 
and a user input device such as a mouse 80. The system 
operates under program control, the programs being 
stored on the disk storage 30 and provided, for example, 



by the network 50, a removable disk (not shown) or a 
p re-installation on the disk storage 30. 
[0015] The storage system operates in two general 
modes of operation. In a first mode, a set of information 

5 items (e.g. textual information items) is assembled on 
the disk storage 30 or on a network disk drive connected 
via the network 50 and is sorted and indexed ready for 
a searching operation. The second mode of operation 
is the actual searching against the indexed and sorted 

10 data. 

[0016] The embodiments are applicable to many 
types of information items. A non-exhaustive list of ap- 
propriate types of information includes patents, video 
material, emails, presentations, internet content, broad- 

15 cast content, business reports, audio material, graphics 
and clipart, photographs and the like, or combinations 
or mixtures of any of these. In the present description, 
reference will be made to textual information Items, or 
at least information items having a textual content or as- 

20 sociation. So, for example, a piece of broadcast content 
such as audio and/or video material may have associ- 
ated "MetaData" defining that material in textual terms. 
[001 7] The information items are loaded onto the disk 
storage 30 in a conventional manner. Preferably, they 

25 are stored as part of a database structure which allows 
for easier retrieval and indexing of the items, but this is 
not essential. Once the information and items have been 
so stored, the process used to arrange them for search- 
ing is shown schematically in Figure 2. 

30 [001 8] It will be appreciated that the indexed informa- 
tion data need not be stored on the local disk drive 30. 
The data could be stored on a remote drive connected 
to the system 10 via the network 50. Alternatively, the 
information may be stored in a distributed manner, for 

35 example at various sites across the internet. If the infor- 
mation is stored at different internet or network sites, a 
second level of information storage could be used to 
store locally a "link" (e.g. a URL) to the remote informa- 
tion, perhaps with an associated summary, abstract or 

40 MetaData associated with that link. So, the remotely 
held information would not be accessed unless the user 
selected the relevant link (e.g. from the results list 260 
to be described below), although for the purposes of the 
technical description which follows, the remotely held in- 

45 formation, or the abstract/summary/MetaData, or the 
link/URL could be considered as the "information Item". 
[0019] In other words, a formal definition of the "infor- 
mation item" is an item from which a feature vector is 
derived and processed (see below) to provide a map- 
so ping to the SOM. The data shown in the results list 260 
(see below) may be the information item itself (if it is held 
locally and is short enough for convenient display) or 
may be data representing and/or pointing to the infor- 
mation item, such as one or more of MetaData, a URL, 

55 an abstract, a set of key words, a representative key 
stamp image or the like. This is inherent in the operation 
"list" which often, though not always, involves listing da- 
fa representing a set of items. 
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[0020] In a further example, the information items 
could be stored across a networked work group, such 
as a research team or a legal firm. A hybrid approach 
might involve some information items stored locaJly and/ 
or some information items stored across a local area 
network and/or some information items stored across a 
wide area network. In this case, the system could be 
useful in locating similar work by others, for example in 
a large multi-national research and development organ- 
isation, similar research work would tend to be mapped 
to similar output nodes in the SOM (see below). Or, if a 
new television programme is being planned, the present 
technique could be used to check for its originality by 
detecting previous programmes having similar content. 
[0021] It will also be appreciated that the system 10 
of Figure 1 is but one example of possible systems 
which could use the indexed information items. Al- 
though it is envisaged that the initial (indexing) phase 
would be carried out by a reasonably powerful compu- 
ter, most likely by a n on -portable computer, the later 
phase of accessing the information could be carried out 
at a portable machine such as a "personal digital assist- 
ant" (a term for a data processing device with display 
and user input devices, which generally fits in one hand), 
a portable computer such as a laptop computer, or even 
devices such as a mobile telephone, a video editing ap- 
paratus or a video camera. In general, practically any 
device having a display could be used for the informa- 
tion-accessing phase of operation. 
[0022] The processes are not limited to particular 
numbers of information items. 

[0023] The process of generating a self-organising 
map (SOM) representation of the information items will 
now be described with reference to Figures 2 to 6. Fig- 
ure 2 is a schematic flow chart illustrating a so-called 
"feature extraction" process followed by an SOM map- 
ping process. 

[0024] Feature extraction is the process of transform- 
ing raw data into an abstract representation. These ab- 
stract representations can then be used for processes 
such as pattern classification, clustering and recogni- 
tion. In this process, a so-called "feature vector" is gen- 
erated, which is an abstract representation of the fre- 
quency of terms used within a document. 
[0025] The process of forming the visualisation 
through creating feature vectors includes: 

• Create "document database dictionary" of terms 

• Create "term frequency histograms" for each indi- 
vidual document based on the "document database 
dictionary" 

• Reduce the dimension of the "term frequency his- 
togram" using random mapping 

• Create a 2-dimensional visualisation of the informa- 
tion space. 

[0026] Considering these steps in more detail, each 
document (information item) 100 is opened in turn. At a 



step 110, all "stop words" are removed from the docu- 
ment. Stop-words are extremely common words on a 
pre-prepared list, such as "a", "the", "however", "about", 
"and", and "the". Because these words are extremely 

s common they are likely, on average, to appear with sim- 
ilar frequency in all documents of a sufficient length. For 
this reason they serve little purpose in trying to charac- 
terise the content of a particular document and should 
therefore be removed. 

10 [0027] After removing stop-words, the remaining 
words are stemmed at a step 1 20, which involves finding 
the common stem of a word's variants. For example the 
words "thrower", "throws", and "throwing" have the com- 
mon stem of throw". 

*5 [0028] A "dictionary" of stemmed words appearing in 
the documents (excluding the "stop" words) is main- 
tained. As a word is newly encountered, it is added to 
the dictionary, and a running count of the number of 
times the word has appeared in the whole document col- 
lection (set of information items) is also recorded. 
[0029] The result is a list of terms used in alt the doc- 
uments in the set, along with the frequency with which 
those terms occur. Words that occur with too high or too 
low a frequency are discounted, which is to say that they 

25 are removed from the dictionary and do not take part in 
the analysis which follows. Words with too low a fre- 
quency may be misspellings, made up, or not relevant 
to the domain represented by the document set Words 
that occur with too high a frequency are less appropriate 

so for distinguishing documents within the set. For exam- 
ple, the term "News" is used in about one third of all doc- 
uments in a test set of broadcast-related documents, 
whereas the word "football" is used in only about 2% of 
documents in the test set. Therefore "football" can be 

35 assumed to be a better term for characterising the con- 
tent of a document than "News". Conversely, the word 
"fottbair (a misspelling of "football") appears only once 
in the entire set of documents, and so is discarded for 
having too low an occurrence. Such words may be de- 

40 fined as those having a frequency of occurrence which 
is lower than two standard deviations less than the mean 
frequency of occurrence, or which is higher than two 
standard deviations above the mean frequency of oc- 
currence. 

45 [0030] A feature vector is then generated at a step 
130. 

[0031] To do this, a term frequency histogram is gen- 
erated for each document in the set. A term frequency 
histogram is constructed by counting the number of 

50 times words present in the dictionary (pertaining to that 
document set) occur with in an individual document. The 
majority of the terms in the dictionary will not be present 
in a single document, and so these terms will have a 
frequency of zero. Schematic examples of termfrequen- 

55 cy histograms for two different documents are shown in 
Figures 3a and 3b. 

[0032] It can be seen from this example how the his- 
tograms characterise the content of the documents. By 
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inspecting the examples it is seen that document 1 has 
more occurrences of the terms "MPEG" and "Video" 
than document 2, which itself has more occurrences of 
the term "MetaData". Many of the entries in the histo- 
gram are zero as the corresponding words are not 5 
present in the document. 

[0033] In a real example, the actual term frequency 
histograms have a very much larger number of terms in 
them than the example. Typically a histogram may plot 
the frequency of over 50000 different terms, giving the 10 
histogram a dimension of over 50000. The dimension of 
this histogram needs to be reduced considerably if it is 
to be of use in building an SOM information space. 
[0034] Each entry in the term frequency histogram is 
used as a corresponding value in a feature vector rep- 
resenting that document. The result of this process is a 
(50000 x 1 ) vector containing the frequency of all terms 
specified by the dictionary for each document in the doc- 
ument collection. The vector may be referred to as 
"sparse" since most of the values will typically be zero, 20 
with most of the others typically being a very low number 
such as 1 . 

[0035] The size of the feature vector, and so the di- 
mension of the term frequency histogram, is reduced at 
a step 1 40. Two methods are proposed for the process 25 
of reducing the dimension of the histogram. 

i) Random Mapping - a technique by which the his- 
togram is multiplied by a matrix of random numbers. 
This is a computationally cheap process. 30 

ii) Latent Semantic Indexing - a technique whereby 
the dimension of the histogram is reduced by look- 
ing for groups of terms that have a high probability 
of occurring simultaneously in documents. These 
groups of words can then be reduced to a single 35 
parameter. This is a computationally expensive 
process. 

[0036] The method selected for reducing the dimen- 
sion of the term frequency histogram in the present em- 40 
bodiment is "random mapping" , as explained In detail in 
the Kaski paper referred to above. Random mapping 
succeeds in reducing the dimension of the histogram by 
multiplying It by a matrix of random numbers. 
[0037] As mentioned above, the "raw" feature vector 45 
(shown schematically in Figure 4a) is typically a sparse 
vector with a size in the region of 50000 values. This 
can be reduced to a size of about 200 (see schematic 
Figure 4b) and still preserve the relative characteristics 
of the feature vector, that is to say, its relationship such so 
as relative angle (vector dot product) with other similarly 
processed feature vectors. This works because al- 
though the number of orthogonal vectors of a particular 
dimension is limited, the number of nearly orthogonal 
vectors is very much larger. 55 
[0038] In fact as the dimension of the vector increases 
any given set of randomly generated vectors are nearly 
orthogonal to each other. This property means that the 



relative direction of vectors multiplied by this matrix of 
random numbers will be preserved. This can be dem- 
onstrated by showing the similarity of vectors before and 
after random mapping by looking at their dot product. 
[0039] It can be shown experimentally that reducing 
a sparse vector from 50000 values to 200 values pre- 
serves their relative similarities. However, this mapping 
is not perfect, but suffices for the purposes of charac- 
terising the content of a document in a compact way. 
[0040] Once feature vectors have been generated for 
the document collection, thus defining the collection's 
information space, they are projected into a two-dimen- 
sional SOM at a step 1 50 to create a semantic map. The 
following section explains the process of mapping to 2-D 
by clustering the feature vectors using a Kohonen self- 
organising map. Reference is also made to Figure 5. 
[0041] A Kohonen Self-Organising map is used to 
cluster and organise the feature vectors that have been 
generated for each of the documents. 
[0042] A self -organising map consists of input nodes 
1 70 and output nodes 1 80 in a two-dimensional array 
or grid of nodes illustrated as a two-dimensional plane 
1 85. There are as many input nodes as there are values 
in the feature vectors being used to train the map. Each 
of the output nodes on the map is connected to the input 
nodes by weighted connections 190 (one weight per 
connection). 

[0043] Initially each of these weights is set to a ran- 
dom value, and then, through an iterative process, the 
weights are "trained". The map is trained by presenting 
each feature vector to the input nodes of the map. The 
"closest" output node is calculated by computing the Eu- 
clidean distance between the input vector and weights 
of each of the output nodes. 

[0044] The closest node is designated the "winner" 
and the weights of this node are trained by slightly 
changing the values of the weights so that they move 
"closer" to the input vector. In addition to the winning 
node, the nodes in the neighbourhood of the winning 
node are also trained, and moved slightly closer to the 
input vector. 

[0045] It is this process of training not just the weights 
of a single node, but the weights of a region of nodes 
on the map, that allow the map, once trained, to pre- 
serve much of the topology of the input space in the 2-D 
map of nodes. 

[0046] Once the map is trained, each of the docu- 
ments can be presented to the map to see which of the 
output nodes is closest to the input feature vector for 
that document, it is unlikely that the weights will be iden- 
tical to the feature vector, and the Euclidean distance 
between a feature vector and its nearest node on the 
map is known as its "quantisation error". 
[0047] By presenting the feature vector for each doc- 
ument to the map to see where it lies yields and x, y map 
position for each document. These x, y positions when 
put in a look up table along with a document ID can be 
used to visualise the relationship between documents. 
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[0048] Finally, a dither component is added at a step 
1 60, which will be described with reference to Figure 6 
below. 

[0049] A potential problem with the process described 
above is that two identical, or substantially identical, in- 
formation items may be mapped to the same node in 
the array of nodes of the SOM. This does not cause a 
difficulty in the handling of the data, but does not help 
with the visualisation of the data on display screen (to 
be described below). In particular, when the data is vis- 
ualised on a display screen, it has been recognised that 
it would be useful for multiple very similar items to be 
distinguishable over a single item at a particular node. 
Therefore, a "dither" component is added to the node 
position to which each information item is mapped. The 
dither component is a random addition of up to ±16 of 
the node separation. So, referring to Figure 6, an infor- 
mation item for which the mapping process selects an 
output node 200 has a dither component added so that 
it in fact may be mapped to any node position within the 
area 21 0 bounded by dotted lines on Figure 6. 
[0050] So, the information items can be considered to 
map to positions on the plane of Figure 6 at node posi- 
tions other than the "output nodes" of the SOM process. 
[0051] An alternative approach might be to use a 
much higher density of "output nodes" in the SOM map- 
ping process described above. This would not provide 
any distinction between absolutely identical information 
items, but may allow almost, but not completely, identi- 
cal information items to map to different but closely 
spaced output nodes. 

[0052] Figure 7 schematically illustrates a display on 
the display screen 60 in which data sorted into an SOM 
is graphically illustrated for use in a searching operation. 
The display shows a search enquiry 250, a results list 
260 and an SOM display area 270. 
[0053] In operation, the usertypes a key word search 
enquiry into the enquiry area 250. The user then initiates 
the search, for example by pressing enter on the key- 
board 70 or by using the mouse 80 to select a screen 
"button" to start the search. The key words in the search 
enquiry box 250 are then compared with the information 
items in the database using a standard keyword search 
technique. This generates a list of results, each of which 
is shown as a respective entry 280 in the list view 260. 
Also, each result has a corresponding display point on 
the node display area 270. 

[0054] Because the sorting process used to generate 
the SOM representation tends to group mutually similar 
information items together in the SOM, the results for 
the search enquiry generally tend to fall in clusters such 
as a cluster 290. Here, it is noted that each point on the 
area 270 corresponds to the respective entry in the SOM 
associated with one of the results in the result list 260; 
and the positions at which the points are displayed with- 
in the area 270 correspond to the array positions of 
those nodes within the node array. 
[0055] Figure 8 schematically illustrates a technique 



for reducing the number of "hits" (results in the result 
list). The user makes use of the mouse 80 to draw a box 
300 around a set of display points corresponding to 
nodes of interest In the results list area 260, only those 
5 results corresponding to points within the box 300 are 
displayed. If these results turn out not to be of interest, 
the user may draw another box encompassing a differ- 
ent set of display points. 

[0056] ft is noted that the results area 260 displays list 
10 entries for those results for which display points are dis- 
played within the box 300 and wh ich satisfied the search 
criteria in the word search area 250. The box 300 may 
encompass other display positions corresponding to 
populated nodes in the node array, but if these did not 
is satisfy the search criteria they will not be displayed and 
so will not form part of the subset of results shown in the 
box 260. 

[0057] Figure 9 schematically illustrates a technique 
for detecting the node position of an entry in the list view 

20 260. Using a standard technique in the field of graphical 
user interfaces, particularly in computers using the so- 
called "Windows" TM operating system, the user may 
"select" one or more of the entries in the results list view. 
In the examples shown, this is done by a mouse click 

25 on a "check box" 310 associated with the relevant re- 
sults. However, it could equally be done by clicking to 
highlight the whole result, or by double-clicking on the 
relevant result and so on. As a result is selected, the 
corresponding display point representing the respective 

30 node in the node array is displayed in a different manner. 
This is shown schematically for two display points 320 
corresponding to the selected results 330 in the results 
area 260. 

[0058] The change in appearance might be a display 

35 of the point in a larger size, or in a more intense version 
of the same display colour, or in a different display col- 
our, or in a combination of these varying attributes. 
[0059] At any time, a new information item can be 
added to the SOM by following the steps outlined above 

40 (j.e. steps 110 to 140) and then applying the resulting 
reduced feature vector to the "pre-trained" SOM mod- 
els, that is to say, the set of SOM models which resulted 
from the self-organising preparation of the map. So, for 
the newly added information item, the map is not gen- 

45 erally "retrained"; Instead steps 150 and 160 are used 
with all of the SOM models not being amended. To re- 
train the SOM every time a new information item is to 
be added is computationally expensive and is also 
somewhat unfriendly to the user, who might grow used 

so to the relative positions of commonly accessed informa- 
tion items in the map. 

[0060] However, there may well come a point at which 
a retraining process is appropriate. For example, if new 
terms (perhaps new items of news, or a new technical 
55 field) have entered into the dictionary since the SOM 
was first generated, they may not map particularly well 
to the existing set of output nodes. This can be detected 
as an increase in a so-called "quantisation error* detect- 
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ed during the mapping of newly received information 
item to the existing SOM. In the present embodiments, 
the quantisation error is compared to a threshold error 
amount. If it is greater than the threshold amount then 2. 
either (a) the SOM is automatically retrained, using all s 
of its original information items and any items added 
since its creation; or (b) the user is prompted to initiate 
a retraining process at a convenient time. The retraining 
process uses the feature vectors of all of the relevant 3. 
information items and reapplies the steps 150 and 1 60 10 
in full. 

[0061 ] Figure 1 0 schematically illustrates a camcord- 
er 500 as an example of a video acquisition and/or 
processing apparatus, the camcorder including an im- 4. 
age capture device 510 with an associated lens 520; a is 
data/signal processor 530; tape storage 540; disk or oth- 
er random access storage 550; user controls 560; and 
a display device 570 with eyepiece 580. Other features 
of conventional camcorders or other alternatives (such 
as different storage media or different display screen ar- 20 5. 
rangements) will be apparent to the skilled man. In use, 
MetaData relating to captured video material may be 
stored on the storage 550, and an SOM relating to the 
stored data viewed on the display device 570 and con- 
trolled as described above using the user controls 560. 25 6. 
[0062] Figure 11 schematically illustrates a personal 
digital assistant (PDA) 600, as an example of portable 
data processing apparatus, having a display screen 610 
including a display area 620 and a touch sensitive area 
630 providing user controls; along with data processing 30 
and storage (not shown). Again, the skilled man will be 7. 
aware of alternatives in this field. The PDA may be used 
as described above in connection with the system of Fig- 
ure 1 . 



Claims 8. 

1 . Ah information retrieval system In which a set of dis- 
tinct information items map to respective nodes in *o 
an array of nodes by mutual similarity of the infor- 
mation items, so that similar information items map 
to nodes at similar positions in the array of nodes; 
the system comprising: 

45 

a graphical user Interface for displaying a rep- 
resentation of at least some of the nodes as a 9. 
two-dimensional display array of display points 
within a display area on a user display; 
a user control for defining a two-dimensional re- so 
gion of the display area; and 
a detector for detecting those display points ly- 
ing within the two-dimensional region of the dis- 
play area; 10 
the graphical user interface also displaying a 55 
list of data representing information items, be- 
ing those information items mapped onto nodes 
corresponding to display points displayed with- 



in the two-dimensional region of the display ar- 
ea. 

A system according to claim 1 , in which the infor- 
mation items are mapped to nodes in the array on 
the basis of a feature vector derived from each in- 
formation item. 

A system according to claim 2, in whbh the feature 
vector for an information item represents a set of 
frequencies of occurrence, within that information 
item, of each of a group of information features. 

A system according to claim 3, in which the Infor- 
mation items comprise textual information, the fea- 
ture vector for an information item represents a set 
of frequencies of occurrence, within that information 
item, of each of a group of words. 

A system according to claim 1 or claim 2, in which 
the information items comprise textual information, 
the nodes being mapped by mutual similarity of at 
least a part of the textual information. 

A system according to claim 4 or claim 5, in which 
the information items are pre-processed for map- 
ping by excluding words occurring with more than 
a threshold frequency amongst the set of informa- 
tion items. 

A system according to any one of claims 4 to 6, in 
which the information items are pre-processed for 
mapping by excluding words occurring with less 
than a threshold frequency amongst the set of in- 
formation items. 

A system according to any one of claims 4 to 7, com- 
prising: 

search means for carrying out a word-related 
search of the information items; 
the search means and the graphical user inter- 
face being arranged to co-operate so that only 
those display points corresponding to informa- 
tion Items selected by the search are displayed. 

A system according to any one of the preceding 
claims, in which the mapping between information 
items and nodes in the array includes a dither com- 
ponent so that substantially identical information 
items tend to map to closely spaced but different 
nodes in the array. 

A system according to any one of the preceding 
claims, comprising a user control for choosing one 
or more information items from the list; the graphical 
user interface being operable to alter the manner of 
display within the display area of display points cor- 
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responding to selected information items. 

11 . A system according to claim 1 0, in which the graph- 
ical user interface is operable to display in a differ- 
ent colour and/or intensity those display points cor- 
responding to information items chosen within the 
list. 

12. An information storage system in which a set of dis- 
tinct information items are processed so as to map 
to respective nodes in an array of nodes by mutual 
similarity of the information items, such that similar 
information items map to nodes at similar positions 
in the array of nodes; the system comprising: 

means for generating a feature vector derived 
from each information Item, the feature vector 
for an information Item representing a set of fre- 
quencies of occurrence, within that information 
item, of each of a group of information features; 
and 

means for mapping each feature vector to a 
node in the array of nodes, the mapping be- 
tween information items and nodes in the array 
including a dither component so that substan- 
tially identical information items tend to map to 
closely spaced but different nodes in the array. 

13. A system according to claim 12, comprising: 

means for mapping a newly received informa- 
tion item to a node in the array of nodes; 
means for detecting a mapping error as the 
newly received information item is so mapped; 
and 

means responsive to a detection that the map- 
ping error exceeds a threshold error amount, 
for initiating a remapping process of the set of 
information items and the newly received infor- 
mation item. 

14. A portable data processing device comprising a 
system according to any one of the preceding 
claims. 

15. Video acquisition and/or processing apparatus 
comprising a system according to any one of the 
preceding claims. 

16. An information storage method in which a set of dis- 
tinct information items are processed so as to map 
to respective nodes in an array of nodes by mutual 
similarity of the information items, such that similar 
information items map to nodes at similar positions 
in the array of nodes; the method comprising the 
steps of: 

generating a feature vector derived from each 



information, the feature vector for an informa- 
tion item representing a set of frequencies of 
occurrence, within that information item, of 
each of a group of information features; and 

5 mapping each feature vector to a node in the 

array of nodes, the mapping between informa- 
tion items and nodes in the array including a 
dither component so that substantially identical 
information items tend to map to closely spaced 

10 but different nodes in the anray. 

1 7. An information retrieval method in which a set of dis- 
tinct information items map to respective nodes in 
an array of nodes by mutual similarity of the infor- 

15 mation items, so that similar information items map 
to nodes at similar positions in the array of nodes; 
the method comprising: 

displaying a representation of at least some of 
20 the nodes as a two-dimensional display array 

of display points within a display area on a user 
display; 

defining, with a user control, a two-dimensional 
region of the display area; 

25 detecting those display points tying within the 

two-dimensional region of the display area; and 
displaying a list of data representing informa- 
tion items, being those information items 
mapped onto nodes corresponding to display 

30 points displayed within the two-dimensional re- 

gion of the display area. 

1 8. Computer software having program code for carry- 
ing out a method according to any one of claims 16 

35 and 17. 

19. A providing medium for providing program code ac- 
cording to claim 18. 

40 20. A medium according to claim 1 9, the medium being 
a storage medium. 

21 . A medium according to claim 1 9, the medium being 
a transmission medium. 

45 
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